Games Instructor : Alex Slivkins Homework 1 : bandits with IID rewards
نویسنده
چکیده
Problem 1: rewards from a small interval. Consider a version of the problem in which all the realized rewards are in the interval [12 , 1 2 + ] for some ∈ (0, 1 2). Define versions of UCB1 and Successive Elimination attain improved regret bounds (both logarithmic and root-T) that depend on the . Hint: Use a more efficient version of Hoeffding Inequality in the slides from the first lecture. It is OK not to repeat all steps from the analysis in the class as long as you explain which steps in the analysis are changed. Comments after due date: The confidence radius and the √ T regret bound can be improved by the factor . The log(T )/∆ regret bound can be improved by the factor of 2. (The 2 improvement is a little surprising; but ∆ ≤ , so, in some sense, one factor of cancels out.) An alternative solution is to transform all rewards from r to (r − 12)/ . Then we obtain the problem with [0, 1] rewards, and regret in the original problem is times regret in the transformed problem. (And ∆ in the original problem is times the ∆ in the transformed problem, hence the 2 improvement in the log(T ) regret bound.) Problem 2: instantaneous regret. Recall: instantaneous regret at time t is defined as ∆(at).
منابع مشابه
G : Bandits , Experts and Games 10 / 10 / 16 Lecture 6 : Lipschitz Bandits
Motivation: similarity between arms. In various bandit problems, we may have information on similarity between arms, in the sense that ‘similar’ arms have similar expected rewards. For example, arms can correspond to “items” (e.g., documents) with feature vectors, and similarity can be expressed as some notion of distance between feature vectors. Another example would be the dynamic pricing pro...
متن کاملLecture 11: Bandits with Knapsacks 2 General Framework: Bandits with Knapsacks (bwk)
The basic version of the dynamic pricing problem is as follows. A seller has B items for sale: copies of the same product. There are T rounds. In each round t, a new customer shows up, and one item is offered for sale. Specifically, the algorithm chooses a price pt ∈ [0, 1]. The customer shows up, having in mind some value vt for this item, buys the item if vt ≥ pt, and does not buy otherwise. ...
متن کاملG : Bandits , Experts and Games 11 / 07 / 16 Lecture 10 : Contextual Bandits
• algorithm observes a “context” xt, • algorithm picks an arm at, • reward rt ∈ [0, 1] is realized. The reward rt depends both on the context xt and the chosen action at. Formally, make the IID assumption: rt is drawn independently from some distribution that depends on the (xt, at) pair but not on t. The expected reward of action a given context x is denoted μ(a|x). This setting allows a limit...
متن کاملLecture 2 : Bandits with i . i . d rewards ( Part II )
So far we’ve discussed non-adaptive exploration strategies. Now let’s talk about adaptive exploration, in a sense that the bandit feedback of different arms in previous rounds are fully utilized. Let’s start with 2 arms. One fairly natural idea is to alternate them until we find that one arm is much better than the other, at which time we abandon the inferior one. But how to define ”one arm is ...
متن کاملG : Bandits , Experts and Games 09 / 19 / 16 Lecture 3 : Lower Bounds for Bandit Algorithms
Note that (2) implies (1) since: if regret is high in expectation over problem instances, then there exists at least one problem instance with high regret. Also, (1) implies (2) if |F| is a constant. This can be seen as follows: suppose we know that for any algorithm we have high regret (say H) with one problem instance in F and low regret with all other instances in F , then, taking a uniform ...
متن کامل